It's Raining Cats and Dogs!

Lab Assignment Three: Exploring Image Data

Justin Ledford, Luke Wood, Traian Pop

Business Understanding

Overview

The data set we are analyzing has over 25000 evenly distributed pictures of dogs and cats. We found it on https://www.kaggle.com/c/dogs-vs-cats and decided it would be a good set to practice processing and analyzing image data due to the clear classification metric of dogs|cats and the large number of high quality photos.

Purpose

This data was originally collected for the purpose of a competition hosted by Kaggle to see who could create the most accurate algorithm that could distinguish between pictures of dogs and cats. Petfined.com are the original owners of the data, and they donated over 3 million pictures of sheltered animals that have been hand classified by them. Because of this, we are highly confident in the data, as the vast majority of humans can successfully identify between a dog and a cat.

The competition itself is long over, but the page still remains for people to play and experiment with the data. This data is important due to potential it holds to help develop and train algorithms that could be used to evolve image recognition software. We can progressively check how useful the data we get from the set is by comparing it against our own classification and seeing if our own conclusions (i.e. if an image classifies a cat or not) match with the information we collect from it.

Predictions

We are not expecting perfect results as cats and dogs can look fairly similar at times, and the black and white color format of the pictures already limits easily differential characteristics between the two animals, such as fur color. However, we do believe that due to the large number and quality of the pictures, we can definitely expect to get a large amount of successful results when processing and analyzing the data.

Applications

This dataset could be used to train an image classifier that could be used by hotels or apartment that only allow one type of pet. For example, Luke's apartment complex only allows cats to stay in the homes so the complex could theoretically place cameras in the hallway and test to see which type of pets are being brought in. This would allow them to more efficiently enforce the no dogs rule which is frequently broken.

Another application is it could also be used for cataloging by a pet shelter to allow owners to search for cats or dogs similar to one they are looking for.

A third application would be to track the history of animals by taking pictures of them at each shelter and documenting information about the animal at each stage.

Data Preparation

In this section, we pre-process the data in order to be usable by our techniques. The first method represents how we take all the images in a given folder (after they have been modified by our own scripts) and append them together to be analyzed later.

In pre-processing the data, we used three separate scripts:

1) aspect: Script developed by Fred Weinhaus (http://www.fmwconcepts.com/imagemagick/aspect/index.php) in order to resize an image to a specific size using imagmagick (https://www.imagemagick.org/) allowing either cropping or padding to deal with the aspect ratio change. (Not shown due to length)

2) process.sh: Script that shrinks images to 150 x 150 while simultaneously using aspect to ensure accuracy. Uses padding to fill space leftover. Leaves all original content there, but adds noise to the edges to fill blanks.

3) crop_process.sh: Script that shrinks images to 150 x 150 while simultaneously using aspect to ensure accuracy. Instead of padding, it simply crops the image in respective to the middle of the original image. Loses quality and detail but no additional noise is added.

Afterward we preprocess the images with aspect and one of the following scripts, we append them all together in one dataframe.

In [1]:
import numpy as np
from PIL import Image
import pandas as pd
import os
import glob
import plotly
import matplotlib.pyplot as plt
%matplotlib inline 

#ignore warnings
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
warnings.simplefilter('ignore', FutureWarning)

def warn(*args, **kwargs):
    pass
warnings.warn = warn

w,h = 150,150

def wildcard_to_df(fpath, greyscale=True):
    X = []
    labels = []
    
    for img_file in glob.iglob(fpath):
        # Read in img as greyscale
        with Image.open(img_file).convert("L") as img:

            # Keep labels for each image in separate array
            if 'cat' in img_file:
                labels.append(0)
            else:
                labels.append(1)

            # Concatenate RGB into one row and collect
            X.append(np.concatenate(np.array(img)))

    X = np.array(X)
    labels = np.array(labels)

    #convert to DF
    return pd.DataFrame(data=X, index=labels), X, labels

Padded Images

Initially we started pre-processing the data by shrinking the images while keeping dimension ratios similar. We filled the empty space remaining from the 150 x 150 with black padding. We assumed this was the best solution due to us keeping the original image ratios and didn't initally take into account the affect the padding would have on the PCA.

In [2]:
df, X, labels = wildcard_to_df('data/barred/*.jpg')

The function below prints out the images in an organized way for us to examine them and see if they pre-processed correctly.

In [3]:
# a helper plotting function (modified code from Professor Eric Larson's repository)
def plot_gallery(images, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.xticks(())
        plt.yticks(())

plot_gallery(X, h, w) # defaults to showing a 3 by 6 subset of the faces

# Visualize some images
plt.show()

We set a default component value to 250 in order to safely overestimate the number of components it takes to accurately and efficiently represent our image data. This way, we can then analyze it later to find a more accurate component number.

Using PCA

In [4]:
from sklearn.decomposition import PCA

n_components = 250
pca = PCA(n_components=n_components)
%time pca.fit(X)
eigens = pca.components_.reshape((n_components, h, w))
CPU times: user 49.4 s, sys: 1.37 s, total: 50.8 s
Wall time: 14.1 s

This function performs linear dimensionality reduction (LDR) of the set of images given using principal component analysis. Then, it visualizes the explained variance of each component in the set of images. We use this information to see how many dimensions are needed to get the most accurate representation of our data set while still maintaining efficiency.

In [5]:
def plot_explained_variance(pca):
    from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })

First, we will run the LDR on the padded images.

In [6]:
plot_explained_variance(pca)

From what we can tell in the graph, it seems that the first component only represents about 18.9% of the variance and it takes up to 171 components to reach a 90% or higher variance. When printing the eigen images afterwards, this is what we produce.

In [7]:
plot_gallery(eigens, h, w)
plt.show()

The issues with the current PCA

After running principal component analysis on our pre-processed images, we realized that we had a huge issue! Almost every single component represented either dark edges or light edges. These edges are not at all representative of the images themselves, but instead show which images we pre-processed to have dark borders or light borders. We realized that we needed to find a different way to format the images so we instead tried cropping the data.


Cropped Images

After some debate, we decided to try cropping the images to 150 x 150 in respect to the original middle point of the picture. Although we received less quality overall as we lost parts of the image, we predicted that we would get better results with the PCA due to less noise from the padding.

In [8]:
#convert to dataframe
df, X, labels = wildcard_to_df('data/cropped/*.jpg')
In [9]:
plot_gallery(X, h, w) # defaults to showing a 3 by 6 subset of the faces

# Visualize some images
plt.show()

It is already noticeable how much easier the data is to look at, at least for a human. Although it is easy to also see that data will be lost in some of the key features (like the face) as we can see from row 2, image 4 (from the left).

We will be performing the same tests as we did for padding.

In [10]:
n_components = 250
pca = PCA(n_components=n_components)
%time pca.fit(X)
eigens = pca.components_.reshape((n_components, h, w))
CPU times: user 45.1 s, sys: 1.4 s, total: 46.5 s
Wall time: 13.1 s
In [11]:
plot_explained_variance(pca)

This time, it seems that the first component only represents about 19.0% of the variance but it takes up to 205 components to reach a 90% or higher variance. When plotting the eigen faces afterwards, this is what we produce.

In [12]:
plot_gallery(eigens, h, w)
plt.show()

Although these eigen faces look more like nightmares than pets, it is clear to see that they are at least slightly more recognizable to, at least, a face than the previous set from the padded images. In the end, we chose to remain with the cropped images due to the noise factor being minimum.

Here is a summary of our final set of pre-processed data.

In [13]:
samples, features = X.shape
classes = len(np.unique(labels))

print("Number of samples in dataset: {}".format(samples))
print("Number of features in dataset: {}".format(features))
print("Number of classes in dataset: {}".format(classes))
print("Original Image Sizes {} by {}".format(h,w))
Number of samples in dataset: 2000
Number of features in dataset: 22500
Number of classes in dataset: 2
Original Image Sizes 150 by 150

Data Reduction

PCA and Randomized PCA

As done in the previous section, we performed and analyzed the linear dimensionality reduction using PCA, but we did not run it through using Randomized PCA. RPCA is typically used to manage larger data sets, but since our data set is fairly small, it should not affect the final outcome significantly.

In [14]:
from sklearn.decomposition import RandomizedPCA

n_components = 250
rpca = RandomizedPCA(n_components=n_components)
%time rpca.fit(X)
eigenfaces_r = rpca.components_.reshape((n_components, h, w))
CPU times: user 11 s, sys: 835 ms, total: 11.8 s
Wall time: 3.87 s
In [15]:
plot_explained_variance(rpca)

The Randomized PCA also had a top 19.0% variance ratio but reached a 90% or higher variance ratio at 209 components instead of the previous 205 by RCA. Next we will display the eigenfaces to see if there is any significant difference, but we predict that due to the small changes in the two numbers we just measured, there will be very little, if any, noticeable difference.

In [16]:
plot_gallery(eigenfaces_r, h, w)

As we predicted, the differences were minuscule. Now, we will compare the reconstruction capabilities between PCA and RPCA. The following method is used to reconstruct an image.

In [17]:
def reconstruct_image(trans_obj,org_features):
    low_rep = trans_obj.transform(org_features)
    rec_image = trans_obj.inverse_transform(low_rep)
    return low_rep, rec_image
    
idx_to_reconstruct = 5
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X[idx_to_reconstruct].reshape(1, -1))
low_dimensional_representation_r, reconstructed_image_r = reconstruct_image(pca,X[idx_to_reconstruct].reshape(1, -1))
In [18]:
plt.figure(figsize=(15,7))

plt.subplot(1,3,1)
plt.imshow(X[idx_to_reconstruct].reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid()

plt.subplot(1,3,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA')
plt.grid()

plt.subplot(1,3,3)
plt.imshow(reconstructed_image_r.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Randomized PCA')
plt.grid()

plt.tight_layout()

The end results between the Full PCA and Randomized PCA look almost completely similar, and it is amazing to see how close both came to the original image. The image quality difference matches previous tests seen done on other data, so we are confident that this is the correct end product that reconstruction from PCA produces. Next, we will look at another way to reduce the dimensions of our images.


Kernel PCA

Kernel PCA builds upon PCA but instead of doing a linear dimensionality reduction, it actually does it non-linearly. Instead of being limited to only a 2 dimensional place, it performs the reduction in a Hilbert space (multi-dimensional space). What this means is that in some situations, the Kernel PCA can see relations not visible on a 2D plane, and thus can produce better results. One downside to this method, at least when implementing it, is that it relies heavily on a "gamma" value that is highly important when running the KernelPCA() method. The value is fairly arbitrary and requires some guesswork before being able to be successful.

In [19]:
from sklearn.decomposition import KernelPCA

n_components = 250
kpca = KernelPCA(n_components=n_components, kernel='rbf', 
                fit_inverse_transform=True, gamma=15) # very sensitive to the gamma parameter
%time kpca.fit(X)
CPU times: user 11.7 s, sys: 371 ms, total: 12.1 s
Wall time: 3.41 s
Out[19]:
KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',
     fit_inverse_transform=True, gamma=15, kernel='rbf',
     kernel_params=None, max_iter=None, n_components=250,
     remove_zero_eig=False, tol=0)
In [20]:
import warnings
warnings.simplefilter('ignore', DeprecationWarning)

from ipywidgets import widgets 

def plt_reconstruct(idx_to_reconstruct):
    idx_to_reconstruct = np.round(idx_to_reconstruct)
    
    reconstructed_image = pca.inverse_transform(pca.transform(X[idx_to_reconstruct]))
    reconstructed_image_rpca = rpca.inverse_transform(rpca.transform(X[idx_to_reconstruct]))
    reconstructed_image_kpca = kpca.inverse_transform(kpca.transform(X[idx_to_reconstruct]))
    
    
    plt.figure(figsize=(15,7))
    
    plt.subplot(1,4,1)
    plt.imshow(X[idx_to_reconstruct].reshape((h, w)), cmap=plt.cm.gray)
    plt.title('Original')
    plt.grid()
    
    plt.subplot(1,4,2)
    plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
    plt.title('Full PCA')
    plt.grid()
    
    plt.subplot(1,4,3)
    plt.imshow(reconstructed_image_rpca.reshape((h, w)), cmap=plt.cm.gray)
    plt.title('Randomized PCA')
    plt.grid()
    
    plt.subplot(1,4,4)
    plt.imshow(reconstructed_image_kpca.reshape((h, w)), cmap=plt.cm.gray)
    plt.title('Kernel PCA')
    plt.grid()

plt_reconstruct(4)
widgets.interact(plt_reconstruct,idx_to_reconstruct=(0,samples-1,1),__manual=True)
Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"
Out[20]:
<function __main__.plt_reconstruct>

Although PCA and RPCA look fairly similar as before, we can evidently see that Kernel PCA is infinitely more accurate almost to the point of almost perfectly similar, at least to the human eye. In terms of quality, Kernel PCA definitely has the upper hand in most cases.

However, there are a few cases where, due to an improper gamma value, Kernel PCA is actually much worse than its counterparts.

In [21]:
n_components = 250
kpca = KernelPCA(n_components=n_components, kernel='rbf', 
                fit_inverse_transform=True, gamma=1) # very sensitive to the gamma parameter
%time kpca.fit(X)
plt_reconstruct(4)
widgets.interact(plt_reconstruct,idx_to_reconstruct=(0,samples-1,1),__manual=True)
CPU times: user 10.7 s, sys: 303 ms, total: 11 s
Wall time: 3.11 s
Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"
Out[21]:
<function __main__.plt_reconstruct>

For example, take the above reconstruction: as you flip through them (Index 2 is an example), you can see that Kernel PCA is not always necessarily the best one to go with. However, even though it can be inconsistent at times depending on the gamma value, we still prefer it over the other ones due to how accurate it can be when it does work. While it might take a bit more effort to get it to consistently and successfully produce good output, we believe it is worth it for the reconstructions that you get.


Feature Extraction

In this part we will be extracting features from our reconstructed data. First, we start by using Sobel filters for edge detection.

In [22]:
from skimage.io import imshow

idx_to_reconstruct = int(np.random.rand(1)*len(X))
img  = X[idx_to_reconstruct].reshape((h,w))
imshow(img)
plt.grid()
In [23]:
from skimage.filters import sobel_h, sobel_v

gradient_mag = np.sqrt(sobel_v(img)**2 + sobel_h(img)**2 ) 
imshow(gradient_mag)
plt.grid()

By detecting the edges in the pictures, we can better see the differentiating features that define the picture, however because our images have a lot of background that is irrelevant to the subject in the image we are trying to classify, it may not be the best feature extraction method. Next, we will try the Daisy method.


Daisy

DAISY is used for bag-of-features image representations which means that it analyzes images for features and treats them like the bag-of-words model treats words. It does this by creating histograms of local image features. The next steps show how it can be used for highly practical purposes of using image feature data to match similar images.

In [24]:
from skimage.feature import daisy

features, img_desc = daisy(img,step=40, radius=10, rings=3, histograms=5, orientations=8, visualize=True)
imshow(img_desc)
plt.grid()
In [25]:
features = daisy(img,step=10, radius=10, rings=2, histograms=4, orientations=8, visualize=False)
print(features.shape)
print(features.shape[0]*features.shape[1]*features.shape[2])
(13, 13, 72)
12168
In [26]:
def apply_daisy(row,shape):
    feat = daisy(row.reshape(shape),step=10, radius=10, rings=2, histograms=6, orientations=8, visualize=False)
    return feat.reshape((-1))

%time test_feature = apply_daisy(X[3],(h,w))
test_feature.shape
CPU times: user 24.3 ms, sys: 4.02 ms, total: 28.3 ms
Wall time: 27.4 ms
Out[26]:
(17576,)
In [27]:
0.019 * len(X) # approximate how long it may run
Out[27]:
38.0
In [28]:
%time daisy_features = np.apply_along_axis(apply_daisy, 1, X, (h,w))
print(daisy_features.shape)
CPU times: user 1min 4s, sys: 8.94 s, total: 1min 13s
Wall time: 1min 14s
(2000, 17576)

After calculating the daisy features of each image, we can find the pairwise distances between each image in order to try and find similar images.

In [29]:
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix = pairwise_distances(daisy_features)
CPU times: user 3.34 s, sys: 117 ms, total: 3.45 s
Wall time: 974 ms
In [30]:
import copy
# find closest image to current image
idx1 = 5
distances = copy.deepcopy(dist_matrix[idx1,:])
distances[idx1] = np.infty # dont pick the same image!
idx2 = np.argmin(distances)

plt.figure(figsize=(7,10))
plt.subplot(1,2,1)
imshow(X[idx1].reshape((h,w)))
plt.title("Original Image")
plt.grid()

plt.subplot(1,2,2)
imshow(X[idx2].reshape((h,w)))
plt.title("Closest Image")
plt.grid()
In [31]:
from ipywidgets import fixed
from ipywidgets import widgets  # make this interactive!

# put it together inside a nice widget
def closest_image(dmat,idx1):
    distances = copy.deepcopy(dmat[idx1,:]) # get all image distances
    distances[idx1] = np.infty # dont pick the same image!
    idx2 = np.argmin(distances)
    
    distances[idx2] = np.infty
    idx3 = np.argmin(distances)
    
    plt.figure(figsize=(10,16))
    plt.subplot(1,3,1)
    imshow(X[idx1].reshape((h,w)))
    plt.title("Original Image ")
    plt.grid()

    plt.subplot(1,3,2)
    imshow(X[idx2].reshape((h,w)))
    plt.title("Closest Image  ")
    plt.grid()
    
    plt.subplot(1,3,3)
    imshow(X[idx3].reshape((h,w)))
    plt.title("Next Closest Image ")
    plt.grid()

#closest_image(idx1)
widgets.interact(closest_image,idx1=(0,samples-1,1),dmat=fixed(dist_matrix),__manual=True)
Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"
Out[31]:
<function __main__.closest_image>

As you can see, the DAISY method is quite accurate when it comes to feature extraction, and although it does have a hard time between classifying between a dog or a cat, it is at least consistent when it comes to position, shape, and other metrics that are similar between pictures that it matches. This kind of classification works well in humans due to it picking up all the little details about our faces that make us unique.


Gabor Kernels for Feature Extraction

In our last feature extraction technique, we use Gabor Kernels, which specialize in texture representation and discrimination. We believe that due to the animal's different fur texture, this extraction method will be more accurate than the two previous ones.

In [32]:
#gabor
from skimage.filters import gabor_kernel
from scipy import ndimage as ndi
from scipy import stats

# prepare filter bank kernels
kernels = []
for theta in range(4):
    theta = theta / 4. * np.pi
    for sigma in (1, 3):
        for frequency in (0.05, 0.25):
            kernel = np.real(gabor_kernel(frequency, theta=theta,
                                          sigma_x=sigma, sigma_y=sigma))
            kernels.append(kernel)

            
# compute the filter bank and take statistics of image
def compute_gabor(row, kernels, shape):
    feats = np.zeros((len(kernels), 4), dtype=np.double)
    for k, kernel in enumerate(kernels):
        filtered = ndi.convolve(row.reshape(shape), kernel, mode='wrap')
        _,_,feats[k,0],feats[k,1],feats[k,2],feats[k,3] = stats.describe(filtered.reshape(-1))
        # mean, var, skew, kurt
        
    return feats.reshape(-1)

idx_to_reconstruct = int(np.random.rand(1)*len(X))

gabr_feature = compute_gabor(X[idx_to_reconstruct], kernels, (h,w))
gabr_feature
Out[32]:
array([  1.04594089e+02,   1.46668304e+03,  -4.70611179e-01,
        -3.17556737e-01,   3.26994667e+01,   3.53562994e+02,
         6.47362849e+00,   7.49852551e+01,   7.03969333e+01,
         5.71934485e+02,  -6.17988435e-01,  -1.04348479e-01,
         2.59503556e+01,   5.85989500e+03,   2.64428044e+00,
         4.99397603e+00,   1.04595956e+02,   1.46630836e+03,
        -4.69729885e-01,  -3.18625060e-01,   3.20183111e+01,
         2.19778233e+02,   4.25967007e+00,   6.61017472e+01,
         7.10085333e+01,   5.71508794e+02,  -5.76092254e-01,
        -1.68385731e-01,   2.26701778e+01,   5.21992050e+03,
         2.89675249e+00,   6.39207009e+00,   1.04593911e+02,
         1.46628457e+03,  -4.69252131e-01,  -3.20085230e-01,
         3.18242222e+01,   1.92398053e+02,   2.83596323e+00,
         4.88186352e+01,   7.03926667e+01,   5.60567216e+02,
        -5.52886641e-01,  -2.49205479e-01,   2.86193778e+01,
         6.39718273e+03,   2.46674718e+00,   4.08586979e+00,
         1.04596756e+02,   1.46655649e+03,  -4.69974900e-01,
        -3.18955984e-01,   3.18782222e+01,   1.95079263e+02,
         2.96280474e+00,   5.05806305e+01,   7.10070222e+01,
         5.72781230e+02,  -5.92631341e-01,  -1.87197406e-01,
         2.45613333e+01,   5.59980445e+03,   2.74428965e+00,
         5.53243117e+00])
In [33]:
# takes ~3 minutes to run entire dataset
%time gabor_stats = np.apply_along_axis(compute_gabor, 1, X, kernels, (h,w))
print(gabor_stats.shape)
CPU times: user 3min 23s, sys: 1.92 s, total: 3min 25s
Wall time: 3min 30s
(2000, 64)

We can calculate the pairwise distances again using the Gabor Filter statistics this time.

In [34]:
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix = pairwise_distances(gabor_stats)
CPU times: user 175 ms, sys: 13.4 ms, total: 188 ms
Wall time: 61 ms
In [35]:
widgets.interact(closest_image,idx1=(0,samples-1,1),dmat=fixed(dist_matrix),__manual=True)
Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"
Out[35]:
<function __main__.closest_image>

After analyzing our results and comparing them to DAISY, we can tentatively conclude that, in terms of animal classification, it seems that the Gabor Kernel method seems to be more accurate in the sense that if the Original Image is a cat, the closest two images tend to either have a cat in second or have a cat in both. This analysis was only done by trial and error, and by no means can a conclusive statement be made yet.

However, it does make sense that a feature extraction method specializing in texture would be accurate when it come to these animals as the two species typically have a very distinct fur pattern and makeup. When using this method on the human data set from class, there were a lot of times where this particular extraction would predict closest images that were not in the original image's class, which makes sense considering humans have fairly similar skin texture among themselves.


Visualizing differences

In this section we shall use all the feature extraction data we collected and visualize differences in our image set. We hope to be able to cluster two separate groups of points in order to demonstrate the separation between cats and dogs.

Heatmaps

To visualize the differences between some instances in each class we decided to plot the pairwise differences computed from the Gabor filters in heat maps. We choose a sample of 30 images from each class to compare to each other.

In [36]:
import seaborn as sns
In [37]:
sample_size = 30
cat_sample_indices = [i for i,x in enumerate(labels) if x == 0][:sample_size]
dog_sample_indices = [i for i,x in enumerate(labels) if x == 1][:sample_size]

cat_dist = dist_matrix[:, cat_sample_indices][cat_sample_indices, :]
dog_dist = dist_matrix[:, dog_sample_indices][dog_sample_indices, :]

max_dist = np.amax(dist_matrix)

First we plot pairwise distances of the first 30 cat images in our set.

In [38]:
sns.heatmap(cat_dist, vmax=max_dist)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x115e59828>

From the plot we can see that there are many images with relatively small differences, especially from images 9-25.

Next we do the same but with dog images.

In [39]:
sns.heatmap(dog_dist, vmax=max_dist)
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x11f9ec898>

We see the same trend in this heatmap as well.

Finally we plot the pairwise distances of both classes on the same heatmap.

In [40]:
g = sns.heatmap(dist_matrix[:,np.arange(0,sample_size)][np.arange(0,sample_size),:],
                vmax=max_dist)
graph_labels = ["dog" if i == 1 else "cat" for i in labels[:sample_size]]
g.set(xticklabels=graph_labels, yticklabels=reversed(graph_labels))
#plt.yticks(rotation=30)
plt.xticks(rotation=90)
Out[40]:
(array([  0.5,   1.5,   2.5,   3.5,   4.5,   5.5,   6.5,   7.5,   8.5,
          9.5,  10.5,  11.5,  12.5,  13.5,  14.5,  15.5,  16.5,  17.5,
         18.5,  19.5,  20.5,  21.5,  22.5,  23.5,  24.5,  25.5,  26.5,
         27.5,  28.5,  29.5]), <a list of 30 Text xticklabel objects>)

Plotting instances from both classes shows higher differences than plotting instances from the same class. This shows that the statistics from the Gabor filters can give us valuable data for classifying between cats and dogs.

Scatterplots

Next we decided to use scatterplots to compare the differences between each i

In [41]:
plot_X = pairwise_distances(gabor_stats)
In [42]:
features = range(0, 2000)
In [43]:
dfplot = pd.DataFrame(data=plot_X, columns=features)
In [44]:
print (dfplot.info())
dfplot.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Columns: 2000 entries, 0 to 1999
dtypes: float64(2000)
memory usage: 30.5 MB
None
Out[44]:
0 1 2 3 4 5 6 7 8 9 ... 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
0 0.000000 13693.604244 12808.263856 7112.417662 11439.157931 6249.262636 8085.522193 10900.943648 11194.026582 3403.470046 ... 19913.980654 13548.552785 9580.118319 12991.009445 10352.967333 12061.183533 10072.184672 16519.799588 9030.346441 5169.975075
1 13693.604244 0.000000 4166.141983 6799.345344 5993.979062 7746.235043 9183.213634 12988.372416 2936.561023 10875.177373 ... 8902.703717 3387.507325 4297.311470 2324.370006 3636.313047 2421.731189 4641.890285 3660.795942 5102.412496 9756.644288
2 12808.263856 4166.141983 0.000000 6613.208967 2588.614025 7245.824246 10785.715424 9308.526859 3630.191643 10326.583589 ... 7556.382632 2159.380879 5213.100012 3848.969969 4287.995756 2862.575313 4879.866386 4993.138872 5399.827550 10234.047803
3 7112.417662 6799.345344 6613.208967 0.000000 5934.793887 1110.047982 5464.838507 9856.147154 4441.140010 4546.047617 ... 13642.088844 7048.549315 2932.537467 6137.982517 3876.802566 5576.915719 3768.337774 9563.816616 3273.375832 4266.400516
4 11439.157931 5993.979062 2588.614025 5934.793887 0.000000 6216.337042 10549.065143 7387.073805 4817.109097 9347.391018 ... 9053.850976 3798.704191 5687.355808 5660.744507 4665.166855 4333.865799 4682.529975 7010.475899 5397.097082 9651.098002

5 rows × 2000 columns

In [45]:
dfplot_fix = dfplot.drop(dfplot.columns[5:2000], axis=1)
print (dfplot_fix.info())
dfplot_fix.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
0    2000 non-null float64
1    2000 non-null float64
2    2000 non-null float64
3    2000 non-null float64
4    2000 non-null float64
dtypes: float64(5)
memory usage: 78.2 KB
None
Out[45]:
0 1 2 3 4
0 0.000000 13693.604244 12808.263856 7112.417662 11439.157931
1 13693.604244 0.000000 4166.141983 6799.345344 5993.979062
2 12808.263856 4166.141983 0.000000 6613.208967 2588.614025
3 7112.417662 6799.345344 6613.208967 0.000000 5934.793887
4 11439.157931 5993.979062 2588.614025 5934.793887 0.000000
In [46]:
dfplot_fix['class'] = labels.astype(np.int)
In [47]:
print (dfplot_fix.info())
dfplot_fix.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 6 columns):
0        2000 non-null float64
1        2000 non-null float64
2        2000 non-null float64
3        2000 non-null float64
4        2000 non-null float64
class    2000 non-null int64
dtypes: float64(5), int64(1)
memory usage: 93.8 KB
None
Out[47]:
0 1 2 3 4 class
0 0.000000 13693.604244 12808.263856 7112.417662 11439.157931 0
1 13693.604244 0.000000 4166.141983 6799.345344 5993.979062 0
2 12808.263856 4166.141983 0.000000 6613.208967 2588.614025 0
3 7112.417662 6799.345344 6613.208967 0.000000 5934.793887 0
4 11439.157931 5993.979062 2588.614025 5934.793887 0.000000 0
In [48]:
sns.set()
sns.pairplot(dfplot_fix, hue="class", size=2)
Out[48]:
<seaborn.axisgrid.PairGrid at 0x11da7df28>

Biplot

In [49]:
X = dfplot_fix
X.replace([np.inf, -np.inf], np.nan)
X_fix = X.fillna(lambda x: X.median())
In [50]:
# Manipulated example from https://github.com/teddyroland/python-biplot/blob/master/biplot.py

def biplot(pca, dat, title=''):
    
    import plotly
    from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    # 0,1 denote PC1 and PC2; change values for other PCs
    xvector = pca.components_[0] 
    yvector = pca.components_[1]
    
    tmp = pca.transform(dat.values)
    xs = tmp[:,0] 
    ys = tmp[:,1]

    annotations = [Scatter(x=xs, y=ys, mode ='markers', name='cumulative explained variance')]
    for i in range(len(xvector)):
        txt = list(dat.columns.values)[i]
        annotations.append(
                Scatter(
                    x=[0, xvector[i]*max(xs)],
                    y=[0, yvector[i]*max(ys)],
                    mode='lines+text',
                    text=['', txt],
                    name=txt,
                ))
    
    plotly.offline.iplot({
        "data": annotations,
        "layout": Layout(xaxis=XAxis(title='Principal Component One'), 
                         yaxis=YAxis(title='Principal Component Two'),
                        title=title)
    })


    plt.show()

X_pca = PCA(n_components=6)
X_pca.fit(X_fix)
biplot(X_pca,X_fix)

KernelPCA components scatterplot

In this part, we execute a scatterplot depicting the KernelPCA instead of the regular linear version of it. We want to see how different the data looks considering how different it looked in earlier examinations.

In [51]:
df, X, labels = wildcard_to_df('data/cropped/*.jpg')
kpca = KernelPCA(n_components=250)
X_kpca = kpca.fit_transform(X)
In [52]:
df_kpca = pd.DataFrame(X_kpca[:,:4])
df_kpca['class'] = labels.astype(np.int)
sns.pairplot(df_kpca, hue='class')
plt.legend(['cats', 'dogs'])
Out[52]:
<matplotlib.legend.Legend at 0x1250cbcf8>

It is interesting that when plotting the top 4 components from KPCA, the instances cluster together, but the cat instances seem to spread more than the dog instances. This could provide information for classifying between the two.


Cool Stuff

Reverse Image Searching

We decided to simulate the scenario of someone searching for a pet in a shelter. We used the images in our data set to represent animals currently in the shelter, and I chose a picture of a cat from online to represent the cat I am searching for.

Here is the type of cat I am looking for:

alt text

I then ran this to reformat the image to our format and added it to our data set. This gave us this image:

alt text

Here is the code used to find similar cats:

In [53]:
def search_for_cat(img_path):
    df0,X0,_=wildcard_to_df(img_path)
    gabor_search = compute_gabor(X0,kernels,(h,w))
    complete_gabor_stats = np.concatenate((gabor_search.reshape(1,-1),gabor_stats))
    dmat = pairwise_distances(complete_gabor_stats)
    distances = copy.deepcopy(dmat[idx1,:]) # get all image distances
    distances[idx1] = np.infty # dont pick the same image!
    idx2 = np.argmin(distances)
    
    distances[idx2] = np.infty
    idx3 = np.argmin(distances)
    
    plt.figure(figsize=(10,16))
    plt.subplot(1,3,1)
    imshow(X0.reshape((h,w)))
    plt.title("Original Image ")
    plt.grid()

    plt.subplot(1,3,2)
    imshow(X[idx2-1].reshape((h,w)))
    plt.title("Closest Image  ")
    plt.grid()
    
    plt.subplot(1,3,3)
    imshow(X[idx3-1].reshape((h,w)))
    plt.title("Next Closest Image ")
    plt.grid()
    
search_for_cat('img/orange-cat-cropped.jpg')

It seems like in this case the reverse image search actually performed really poorly. We can see that the animals in all three of the images are standing in the center of the camera, and they are all standing upright. The closest image definitely has a similar fur type as the cat. It seems as though we can do a pretty good job at grouping similar photos of animals, but it seems that when it comes to grouping the animals by species our model does pretty poorly.

In [ ]: